Shumway, R.; Stoffer, D; Time Series Analysis and Its Applications with R examples.
Framework and packages:
We will work with the fpp3 package, developed by Prof. Rob Hyndman and his team. They are actually one of the leading contributing groups to the time series community, so do not forget their name.
This package extends the tidyverse framework to time series. For that it requires to load some packages under the hood (see output of the following cell):
library(fpp3)
Additionally, for this particular session we will use the packages GGally, fma and patchwork. Install them if you do not have them already. Remember that the command install.packages is to be run only once, not every time you want to use the package.
install.packages("GGally", dependencies =TRUE) # to generate a scatterplot matrixinstall.packages("fma", dependencies =TRUE) # to load the Us treasury bills datasetinstall.packages("patchwork", dependencies =TRUE) # Used to manage the relative location of ggplots
Once you have installed the packages with the install.packages command, you may use them simply by loadeing them with the library command. You have to do this every time you restart your R session.
library(patchwork) # Used to manage the relative location of ggplotslibrary(GGally)library(fma) # to load the Us treasury bills dataset
Time plots and basic time series patterns
Time series patterns
Pattern
Description
Trend
Long-term increase OR decrease in the data
Seasonal
Periodic pattern due to the calendar (quarter, month, day of the week...)
Cyclic
Pattern where data exhibits rises and falls that are NOT FIXED IN PERIOD (typically duration of at least 2 years)
Seasonal or cyclic
Seasonal
Cyclic
Seasonal pattern of constant length
Cyclic patern of variable length
Shorter than cyclic pattern
Longer than periodic pattern
Magnitude less variable than cyclic pattern
Magnitude more variable than periodic pattern
Let us look at some examples:
Example 1 - trended and seasonal data
scale_x_yearquarter: adjusting the grid when the time index follows a `yearquarter object
Inspecting the dataset aus_production we note two things:
As all the datasets loaded along the fpp3 package, it comes in tsibble format.
The column signaling time (the index of the tsibble) is of type `yearquarter``
Because the index of the tsibble is of type yearquarter, we will correspondingly use the function scale_x_yearquarter when we want to scale the x-axis of the timeplot below
aus_production %>%# Filter data to show appropriate timeframefilter((year(Quarter) >=1980&year(Quarter)<=2000)) %>%# autoplot: generates a simple timeplot, yet to be adjusted with# further commands.autoplot(Electricity) +# Scale the x axis adequately# scale_x_yearquarter used because the index is a yearquarter columnscale_x_yearquarter(date_breaks ="1 year",minor_breaks ="1 year") +# Flip x-labels by 90 degreestheme(axis.text.x =element_text(angle =90))
Strong trend with a change in the slope of the trend around 1990.
autoplot() as a wrapper around ggplot2 objects
The function autoplot() of the library fpp3 is nothing but a wrapper around ggplot2 objects. Depending on the object fed to autoplot, it will have one or other behavior, combining different ggplot objects.
In the example above, autoplot() is fed a time series in tsibble format. In this situation, it acts as a wrapper around the ggplot command geom_line(). It is just that using autoplot() is much more convenient and it is a good idea to have such a function in a time series dedicated library such as fpp3. The code below uses explicitly all ggplot commands and attains the exact same output as using autoplot.
aus_production %>%filter((year(Quarter) >=1980&year(Quarter)<=2000)) %>%# The two lines below are equivalent to autoplot(Electricity)ggplot(aes(x = Quarter, y = Electricity)) +geom_line() +# Scale the x axis adequately# scale_x_yearquarter used because the index is a yearquarter columnscale_x_yearquarter(date_breaks ="1 year",minor_breaks ="1 year") +# Flip x-labels by 90 degreestheme(axis.text.x =element_text(angle =90))
Example 2 - exercise - trended and seasonal data with recessions
Scale the x-grid of the following graph so that the end of each year is clearly visible.
HINT: remember you need to inspect the dataset aus_production and check the type of its index column (time column) to pick the right function to scale the x axis.
ustreas is a ts object (time-series). This is another possible format for time series in R, but remember we are going to use tsibles to allow for seamless functionality with the fpp3 library. You can convert the ts to a tsibble using as_tsibble in the following manner:
# A tsibble: 100 x 2 [1]
index value
<dbl> <dbl>
1 1 91.6
2 2 91.6
3 3 91.4
4 4 91.4
5 5 91.2
6 6 90.9
7 7 91.0
8 8 91.2
9 9 91.0
10 10 91.1
# ℹ 90 more rows
autoplot(ustreas_tsibble)
Plot variable not specified, automatically selected `.vars = value`
Looks like a downward trend, but it is part of a much longer cycle (not shown). You do not want to make forecasts with this data too long into the future.
scale_x_continuous: adjusting the grid when the index is numeric.
Examining the tsibble ustreas_tsibble reveals that the index is numeric (trading day) (“dbl”, meaning double precision floating point).
ustreas_tsibble %>%head(5)
# A tsibble: 5 x 2 [1]
index value
<dbl> <dbl>
1 1 91.6
2 2 91.6
3 3 91.4
4 4 91.4
5 5 91.2
If the index is of type numeric we need to use the function scale_x_continuous to properly adjust the x-axis. In addition to this, we need to generate a specific sequence signaling the ticks. For example, if we want major ticks every 10 days and minor ticks every 5 days:
# Sequence for major ticks, from 0 to the maximum index in the datamajor_ticks_seq =seq(0, max(ustreas_tsibble$index), 10)major_ticks_seq
[1] 0 10 20 30 40 50 60 70 80 90 100
# Sequence for minor ticks, from 0 to the maximum index in the dataminor_ticks_seq =seq(0, max(ustreas_tsibble$index), 5)minor_ticks_seq
Plot variable not specified, automatically selected `.vars = value`
Example 4 - exercise. Yearly data and cyclic behavior
ANUAL DATA is not susceptible to within-year calendar periods (so called “seasonal periods”). It is sampled at the same point every year. Therefore, within-year calendar periods do not affect yearly data.
In other words: there are no seasonal patterns in yearly data. If anything there could be cyclic behavior.
Hudson Bay Company trading records for Snowshoe Hare and Canadian Lynx furs from 1845 to 1935. This data contains trade records for all areas of the company. We are going analyse the number of Canadian Lynx pelts traded.
pelt %>%head(5)
# A tsibble: 5 x 3 [1Y]
Year Hare Lynx
<dbl> <dbl> <dbl>
1 1845 19580 30090
2 1846 19600 45150
3 1847 19610 49150
4 1848 11990 39520
5 1849 28040 21230
lynx <- pelt %>%select(Year, Lynx)lynx %>%head(5)
# A tsibble: 5 x 2 [1Y]
Year Lynx
<dbl> <dbl>
1 1845 30090
2 1846 45150
3 1847 49150
4 1848 39520
5 1849 21230
Exercise
Plot the time series ensuring we have major ticks every 10 years and minor ticks every 5 years.
Hint: check the type of the index sequence to select the correct variation of the family of functions scale_x_...() you need to properly scale the x-axis.
Plot variable not specified, automatically selected `.vars = Lynx`
Number of lynx trapped anualy in the Hudson bay reach* BECAUSE THIS IS ANUAL DATA, IT CANNOT BE SEASONAL* Population of lynx raises when there is plenty of food+ When food supply is low -> population plummets+ Surviving lynx have excess food -> population raises
Example 5 - exercise - Cyclic + seasonal behavior
Monthly sales of new one-family houses sold in the USA since 1973.
The ts object hsales is loaded along the fma library
hsales_tsibble <- hsales %>%as_tsibble()
Create a timeplot that has major ticks every 5 years and minor ticks every year
Hint: check the type of the index sequence to select the correct variation of the family of functions scale_x_...() you need to properly scale the x-axis. The one you need here has not been used in previous examples, but it is consistent with the type of the index.
hsales_tsibble %>%autoplot() +scale_x_yearmonth(date_breaks ="5 year",minor_breaks ="1 year")
Plot variable not specified, automatically selected `.vars = value`
Example 6 - multiple seasonality
Half-hourly electricity demand in England and Wales from Monday 5 June 2000 to Sunday 27 August 2000. Discussed in Taylor (2003), and kindly provided by James W Taylor. Units: Megawatts
Again taylor is a time-series object (ts)
class(taylor)
[1] "msts" "ts"
However, since the series is sampled every half-hour (that is, we are dealing with datetime info), usiung as_tsibble() directly does not adequately convert this time series to a tsibble:
# Column index represented as a double! This is not what we want.taylor %>%as_tsibble()
# A tsibble: 4,032 x 2 [9.9999999960505e-07]
index value
<dbl> <dbl>
1 1 22262
2 1.00 21756
3 1.01 22247
4 1.01 22759
5 1.01 22549
6 1.01 22313
7 1.02 22128
8 1.02 21860
9 1.02 21751
10 1.03 21336
# ℹ 4,022 more rows
To adequately transform the taylor ‘ts’ object to a tsibble, we are going to define a sequence of date-time objects that match the length of the taylor object. Based on the description and assuming the measurements on the 5th of June 2000 start at midnight, the code below turns the ts object into a tsibble. It is not necessary that you fully understand this code, but it is explained in detail:
# Define the start time based on information about the Taylor series# Assume UTC time.start_time <-as.POSIXct("2000-06-05 00:00:00", tz ="UTC")# We specify by = 1800 because the lowest unit we specified in our datetime object# start_time are seconds (hh:mm:ss). The samples in this ts object are taken every# half an hour, that is, every 1800 seconds (30 * 60)# We specify as well that the sequence needs to be of the same length as the taylor# objectdatetime_seq <- start_time +seq(from =0, by =1800, length.out =length(taylor))# Convert the time series to a tsibbletaylor_tsibble <-tibble(index = datetime_seq,value =as.numeric(taylor) ) %>%as_tsibble(index = index)# Check the resulting tsibbletaylor_tsibble
Plot variable not specified, automatically selected `.vars = value`
Two types of seasonality
Daily pattern
Weekly pattern
If we had data over a few years: we would see annual patern
If we had data over a few decades: we might see a longer cyclic pattern
To inspect these patterns more closely, I am going to filter for the first four weeks in the time series. To attain this, I am going to create an auxiliary week column and then filter based on that column:
# Create auxiliary column signalling the yearweek corresponding to# every point in the time seres.taylor_tsibble <- taylor_tsibble %>%mutate(week =yearweek(index) )# Extract first week and compute 4th weekweek1 <- taylor_tsibble$week[1]week4 <- week1 +3# Filter for first four weeks and store in new objecttaylor_tsibble_4w <- taylor_tsibble %>%filter( week >= week1, week <= week4 )# Depict resulttaylor_tsibble_4w %>%autoplot() +# Used because index is date-timescale_x_datetime(breaks ="1 week",minor_breaks ="1 day" )
Plot variable not specified, automatically selected `.vars = value`
Here we can see the patterns more closely.
IMPORTANT: the demand of electricity does not always follow such an extremely regular pattern. There might be weekly seasonality, but it is not necessarily so regular. See the next example.
Example 7 - exercise
We are going to use the dataset vic_elec from the fpp3 library. It provides half-hourly electricity demand in for Victoria (Australia).
Plot variable not specified, automatically selected `.vars = Demand`
We can spot a daily pattern but this time there is not a clear weekly pattern.The data does not necessarily exhibit all the possible seasonal patterns.
Example 8 - exercise
scale_x_date: indices that are date objects
Let us begin by creating a daily time series out of the previos elec_jan (see example 7) using index_by to compute the total electricity consumption within each day
Plot variable not specified, automatically selected `.vars = daily_demand`
Example 9 - Trend + seasonality (multiplicative)
PBS contains monthly medicare australia prescription data. We are going to select those regarding drugs of the category ATC2. For further info run ?PBS in the console.
PBS %>%filter(ATC2 =="A10") %>%select(Month, Concession, Type, Cost) %>%summarise(TotalC =sum(Cost)) %>%# Comment on this assignment operatormutate(Cost = TotalC /1e6) -> a10
a10 %>%head(5)
# A tsibble: 5 x 3 [1M]
Month TotalC Cost
<mth> <dbl> <dbl>
1 1991 Jul 3526591 3.53
2 1991 Aug 3180891 3.18
3 1991 Sep 3252221 3.25
4 1991 Oct 3611003 3.61
5 1991 Nov 3565869 3.57
autoplot(a10, Cost) +labs(y ="$ (millions)",title ="Australian antidiabetic drug sales") +scale_x_yearmonth(breaks ="1 year") +theme(axis.text.x =element_text(angle =90))
Scatter plots
Two variables
The dataset vic_elec, which contains half-hourly data for the demand of electricity and average temperature in the state of Victoria (Australia).
Let us depict a timeplot for the demand and a timeplot for the average temperature. The code snippet below requires you to load the library GGally for one statement to run (comment in the code):
For every point in time, we may now depict the electricity demand and the corresponding average temperature, resulting in a scatterplot of these two variables instead of a timeplot. In this scatterplot we also add two trend lines: - A linear trend line following ordinary least squares. - A non-linear trend line following a loess model
vic_elec %>%filter(year(Time) ==2014) %>%ggplot(aes(x = Temperature, y = Demand)) +geom_point() +# Linear trend linegeom_smooth(method ="lm", se =FALSE) +# Non-linear trend linegeom_smooth(method ="loess", color ="red", se =FALSE) +# Labelslabs(x ="Temperature (degrees Celsius)",y ="Electricity demand (GW)")
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
Just by looking at the graph, we can see that the linear trend line does not capture the entire strignth of the relationship between demand and temperature.
Correlation coefficients - careful interpretation
In your previous courses on statistics, you have been introduced to the correlation coefficient between two variables X and Yand you have also derived (hopefully) the relationship between this correlation coefficient and a linear trend line between X and Y obtained using ordinary least squares (ols):
TODO: INCLUDE EQUATION OF PEARSONS CORRELATION COEFFICIENT
TODO: INCLUDE EQUATION OF HOW THE LINEAR TREND BE
The equation for that linear trend line tells us that, the correlation coefficient alone only accounts for the linear relationship between X and Y but it does not measure any non-linear relationships.
Important example 1 - correlation coefficient of 0
From the equation relating the correlation coefficient and the slope of the linear model, we can see that, if the linear relationshp is 0 (slope of the trend \(=\) 0) \(\rightarrow\)\(r_{xy}=0\). However there may be other non-linear relationships not captured by the correlation coefficient. The correlation coefficient only measures the strength of linear relationship.
Let us see a specific toy example that will should keep in your head forever.In this, we take some points out of the function $y=x^2$. We take as many points to the left than points to the right, located at symmetrical positions around the y axis.
df <-tibble(x =seq(-4, 4, 0.05),y = x^2 )df %>%ggplot(aes(x = x, y = y)) +geom_point() +geom_smooth(method ="lm", se =FALSE)
`geom_smooth()` using formula = 'y ~ x'
In the figure we can see that the slope of the linear relationship is 0, and, as a result, the correlation coefficient is also 0:
# **QUESTION**: Why is it not exactly 0?cor(df$x, df$y)
[1] 1.494299e-16
ANSWER* Its a matter of how the computer handles numbers.* An output with that exponent is 0for our intents and purposes.* You need to get used to recognising this kind of scenarios.
But we know for a fact that there is a non-linear relationship between the variables (we have defined the points using it)..
Important example 2 - correlatoin
If the linear relationship is not 0 (slope of the trend line \(\neq\) 0) \(\rightarrow\)\(r_{xy}\neq0\). But there may be a stronger non-linear relationship as in the case of the electricity demand.
For example, the correlation for the electricity demand and temperature data shown in the previous figure is only 0.26, but their non-linear relationship is actually much stronger (this can be seen comparing the linear model and the loess model fitted in the previous scatterplot)
Remember that a correlation coefficient of 0 does not imply that there is no relationship between two variables. It just implies that there is no linear relationship.
Note: beware that there are very bad sources on statistics and data science on the internet:
Other examples of variables with a correlation coefficient of 0:
Examples of corr coefficient = 0 (Wikipedia)
Correlation coefficient vs. shape of the relationship
Remember as well that two variables may have a very similar correlation coefficent yet have a very different aspect. The example below (ref [1]) shows scatterplots of different variables, all of which have a correlation coefficient of 0.82. The set of four images below above is known as Ancombe’s quartet.
Examples of variables having the same correlation coefficient. From [1]
Finally, the animation below is intended to show you that, given a correlation coefficient and a number \(n\) of datapoints, there are infinitely many orderings of those datapoints that result in the same correlation coefficient. In other words, the correlation coefficient is relevant, but it is not very informative of the distribution of the points.
Datasaurus - from ref. 3
All these examples are meant to make you aware of how important it is to depict a scatterplot of the variables and not to rely only on the correlation coefficient, which is a summary only of their linear relationship.
The code snippet below loads the dataset soi_recuitment. Run the code and select the file soi_recruitment.csv when prompted to pick up a file. You will find this in the folder ZZ_Datasets, on google drive. This dataset contains two variables and has been taken from reference 4.
SOI: Southern Oscillation Index (SOI) for a period of 453 months ranging over the years 1950-1987. The Southern Oscillation Index (SOI) is a standardized index based on the observed sea level pressure (SLP) differences between Tahiti and Darwin, Australia. The SOI is one measure of the large-scale fluctuations in air pressure occurring between the western and eastern tropical Pacific (i.e., the state of the Southern Oscillation) during El Niño and La Niña weather episodes. For more information about this variable, see link
Recruitment: Recruitment (index of the number of new fish) for a period of 453 months ranging over the years 1950-1987. Recruitment is loosely defined as an indicator of new members of a population. The data has been gathered by Dr. Roy Mendelssohn, from the Pacific Environmental Fisheries Group.
Lagged variables and their importance in time series
We are going to delve in this concept further in the next classes, but it is important to introduce it now.
Given a variable \(x_t\), its lags correspond to delayed versions of the variable, with the number of the lag indicating the time delay:
The first lag of a variable \(x_t\) is written as \(x_{t-1}\). At any point in time \(t\), the value of \(x_{t-1}\) is the value of \(x_t\) one timestep ago.
The second lag of a variable \(x_t\) is written as \(x_{t-2}\). At any point in time \(t\), the value of \(x_{t-2}\) is the value of \(x_t\) two timestep ago.
…
The n-th lag of a variable \(x_t\) is written as \(x_{t-n}\). At any point in time \(t\), the value of \(x_{t-n}\) is the value of \(x_t\)\(n\) timesteps ago.
Lagged variables are relevant in time series for multiple reasons. Primarily because:
By studying the relationship between the lags {\(x_{t-1}\), \(x_{t-2}\), \(\cdots\), \(x_{t-n}\)} and \(x_t\), we can **understand if the variable \(x_t\) is related to its values some timesteps ago.
By studying the relationship between the lags {\(x_{t-1}\), \(x_{t-2}\), \(\cdots\), \(x_{t-n}\)} and a second variable \(y_t\), we can understand if the variable \(y_t\) is related to what happened in the past in the variable x.
A typical business case: current sales might be related to investment in marketing some timesteps ago.
As an example, the code below computes the first 8 lags of the variable SOI:
# A tsibble: 453 x 11 [1M]
ym recruitment SOI SOI_l1 SOI_l2 SOI_l3 SOI_l4 SOI_l5 SOI_l6 SOI_l7
<mth> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1950 Jan 68.6 0.377 NA NA NA NA NA NA NA
2 1950 Feb 68.6 0.246 0.377 NA NA NA NA NA NA
3 1950 Mar 68.6 0.311 0.246 0.377 NA NA NA NA NA
4 1950 Apr 68.6 0.104 0.311 0.246 0.377 NA NA NA NA
5 1950 May 68.6 -0.016 0.104 0.311 0.246 0.377 NA NA NA
6 1950 Jun 68.6 0.235 -0.016 0.104 0.311 0.246 0.377 NA NA
7 1950 Jul 59.2 0.137 0.235 -0.016 0.104 0.311 0.246 0.377 NA
8 1950 Aug 48.7 0.191 0.137 0.235 -0.016 0.104 0.311 0.246 0.377
9 1950 Sep 47.5 -0.016 0.191 0.137 0.235 -0.016 0.104 0.311 0.246
10 1950 Oct 50.9 0.29 -0.016 0.191 0.137 0.235 -0.016 0.104 0.311
# ℹ 443 more rows
# ℹ 1 more variable: SOI_l8 <dbl>
Exercise: generate the scatterplot matrix below
Question: do you detect any non-linear relationship between recruitment and SOI or any of its lags soi_li that is not adequately captured by the corr. coefficient?
# Count number of columnsncols <-length(names(soi_recruitment))# Generate scatterplot matrixsoi_recruitment %>% GGally::ggpairs(columns =2:ncols, lower =list(continuous =wrap("smooth_loess", color="lightblue", se=TRUE))) +theme(axis.text.x =element_text(angle =90))
The relationships between recruitment and SOI and its different lags seem fairlylinear, as the smoothing lines show. The correlation coefficient seems tocapture them properly in this case.## IMPORTANT NOTE: LEADING VS LAGGING VARIABLESFor soi_l5 to soi_l8 the relationship is stronger. This indicates that the soiseries "leads" the recruitment series. That is, what happened some timesteps agoin the soi series (t-5, t-6, t-7, t-8...) is correlated to what happens at timet in the recruitment series.Conversely, the variable recruitment "lags" the variable soi. What happenedon the current timestap in recruitment is related to the value of soi sometimestamps ago.These concepts will be more clear when we discuss autocorrelation andcrosscorrelation
Seasonal plots - gg_season
Difference with a time-plot: now the x-axis does not extend indefinately over time, but rather only over the duration of a season. The time axis is limited to a single season. In most of the situations we are going to handle in this course, that length of the season will be a year.
The period of a time series is always of greater order than the sampling frequency of the time series.
Example 1: with quarterly data, the natural seasonal period to consider is the immediatly superior calendar unit, which is a year.
Example 2: with monthly data, the natural seasonal period to consider are the immediatly superior calendar units, either a quarter or a year. In most instances we will consider a year when dealing with monthly data:
Example 3: with daily data, multiple seasonal periods can be considered: week, month, year…
Example 4: with sub-daily data, the day itself could be considered as a seasonal period…
…
Let us consider again the example of the antidiabetic drug sales in Australia. The dataset PBS contains monthly medicare australia prescription data. We are going to select those regarding drugs of the category ATC2. For further info run ?PBS in the console.
PBS %>%filter(ATC2 =="A10") %>%select(Month, Concession, Type, Cost) %>%summarise(TotalC =sum(Cost)) %>%# Comment on this assignment operatormutate(Cost = TotalC /1e6) -> a10a10
# A tsibble: 204 x 3 [1M]
Month TotalC Cost
<mth> <dbl> <dbl>
1 1991 Jul 3526591 3.53
2 1991 Aug 3180891 3.18
3 1991 Sep 3252221 3.25
4 1991 Oct 3611003 3.61
5 1991 Nov 3565869 3.57
6 1991 Dec 4306371 4.31
7 1992 Jan 5088335 5.09
8 1992 Feb 2814520 2.81
9 1992 Mar 2985811 2.99
10 1992 Apr 3204780 3.20
# ℹ 194 more rows
# Let us remember the time-plotautoplot(a10, Cost) +labs(y ="$ (millions)",title ="Australian antidiabetic drug sales") +scale_x_yearmonth(breaks ="1 year") +theme(axis.text.x =element_text(angle =90))
In this case we have quarterly data, so the natural period to consider is a year, which covers all the calendar variations and is of higher order than the data considered. The time-plot seems to confirm this as well. We will deal with seasonality and periods in more detail in the coming lessons through the concept of autocorrelation.
Now ley us produce a seasonal plot using gg_season
As you can see, the length of the x-axis is limited to the duration of a year. Then each year is represented and color coded using a gradient scale. On this plot we can see that there are jumps in the general level of the series each year. We also see a fiest decline from January to Febrary and a subsequent recovery throughout the rest of the year
Multiple seasonal periods
The dataset vic_elec, loaded along with fpp3, contains half-hourly electricity demand for the state of Victoria (Australia) for three years. Because data is sampled every half-hour, we can find multiple seasonal patterns in the data:
Daily pattern (period 48)
Weekly pattern
Yearly pattern…
Daily period example
Let us examine when the data begins and ends. A possible way to do this is:
It appears that the dataset contains measurements of the years 2012, 2013 and 2014 (three years).
It is possible to adjust the seasonal period considered by gg_season (if applicable to the data) through the argument period. For example, we may consider a daily period as follows:
vic_elec %>%gg_season(Demand, period ="day") +theme(legend.position ="none") +labs(y="MWh", title="Electricity demand: Victoria")
Question: what does each line represent in the data?
Question: how many datapoints do we have?
Question: how many lines do we have in our graph?
Question how many data points per line do we have?
Solutions to be provided after class:
Weekly period example
vic_elec %>%gg_season(Demand, period ="week") +theme(legend.position ="none") +labs(y="MWh", title="Electricity demand: Victoria")
Question: what does each line represent in the data?
Question: how many weeks are there in a year?
Question: How many lines do we have in this data?
Question how many datapoints does each line have?
yearly perio example
vic_elec %>%gg_season(Demand, period ="year") +labs(y="MWh", title="Electricity demand: Victoria")
Question: what does each line represent?
Question: how many datapoints do we have per line?
Question: propose an alternatve, less noisy way of representing the data:
Seasonal subseries plot
Exercise seasonal plots
The aus_arrivals (loaded with the fpp3 library) data set comprises quarterly international arrivals (in thousands) to Australia from Japan, New Zealand, UK and the US.
Create a:
Timeplot of the arrivals using autoplot() (one curve per country of origin)
Seasonal plot using gg_season() (due to the structure of the data it will produce one graph per country of origin)
A gg_subseries() plot that shows the evolution of the quarterly arrivals per country over the time period considered (see next section on gg_subseries())
aus_arrivals
# A tsibble: 508 x 3 [1Q]
# Key: Origin [4]
Quarter Origin Arrivals
<qtr> <chr> <int>
1 1981 Q1 Japan 14763
2 1981 Q2 Japan 9321
3 1981 Q3 Japan 10166
4 1981 Q4 Japan 19509
5 1982 Q1 Japan 17117
6 1982 Q2 Japan 10617
7 1982 Q3 Japan 11737
8 1982 Q4 Japan 20961
9 1983 Q1 Japan 20671
10 1983 Q2 Japan 12235
# ℹ 498 more rows
aus_arrivals %>%autoplot(Arrivals)
Generally the number of arrivals to Australia is increasing over the entire series, with the exception of Japanese visitors which begin to decline after 1995.The series appear to have a seasonal pattern which varies proportionally to the number of arrivals. Interestingly, the number of visitors from NZ peaks sharply in1988. The seasonal pattern from Japan appears to change substantially.
The seasonal pattern of arrivals appears to vary between each country. In particular, arrivals from the UK appear to be lowest in Q2 and Q3, and increase substantially for Q4 and Q1. Whereas for NZ visitors, the lowest period of arrivals is in Q1, and highest in Q3. Similar variations can be seen for Japan and US.
aus_arrivals %>%gg_subseries(Arrivals)
The subseries plot reveals more interesting features. It is evident that whilst the UK arrivals is increasing, most of this increase is seasonal. More arrivals are coming during Q1 and Q4, whilst the increase in Q2 and Q3 is less extreme. The growth in arrivals from NZ and US appears fairly similar across all quarters. There exists an unusual spike in arrivals from the US in1992 Q3.Unusual observations:*2000 Q3: Spikes from the US (Sydney Olympics arrivals)*2001 Q3-Q4 are unusual forUS (9/11 effect)*1991 Q3 is unusual for the US (Gulf war effect?)
Seasonal subseries plots: gg_subseries()
Alternative plot to emphasize seasonal patterns. The data for each season are collected together in separate mini time plots, printing one plot for each of the seasonal divisions. All this is best understood through examples:
Example with monthly data
a10
# A tsibble: 204 x 3 [1M]
Month TotalC Cost
<mth> <dbl> <dbl>
1 1991 Jul 3526591 3.53
2 1991 Aug 3180891 3.18
3 1991 Sep 3252221 3.25
4 1991 Oct 3611003 3.61
5 1991 Nov 3565869 3.57
6 1991 Dec 4306371 4.31
7 1992 Jan 5088335 5.09
8 1992 Feb 2814520 2.81
9 1992 Mar 2985811 2.99
10 1992 Apr 3204780 3.20
# ℹ 194 more rows
a10 %>%gg_subseries(Cost) +labs(y ="$ (millions)",title ="Australian antidiabetic drug sales", )
Why does it produce this output?
Because of the date format: Year-Month.
The smallest time unit in the index is taken as unit in which the period is to be split. In this case month.
The period considered is the biggest possible period included in the time index. In this case, year.
As you can see, specifying “year” as the period is simply redundant in this case.
a10 %>%gg_subseries(Cost, period ="year") +labs(y ="$ (millions)",title ="Australian antidiabetic drug sales", )
Question: what will the output of this code be?
a10 %>%gg_subseries(Cost, period ="month") +labs(y ="$ (millions)",title ="Australian antidiabetic drug sales", )
Question what is the output of this code?
a10 %>%gg_subseries(Cost, period ="day") +labs(y ="$ (millions)",title ="Australian antidiabetic drug sales", )
Example with sub-daily data
NOTE: this example is not extremely important from a mathematical point of view, but rather to be able to understand the logic behind sub-seasonal plots.
Question: what would be the output of this code? DO NOT RUN THIS CELL. It might crash your computer.
vic_elec %>%gg_subseries(Demand) +labs(y ="Demand [MW]",title ="Half-hourly energy demand in the state of Victoria", )
It would take the greatest time unit in the index (year) and split it by the smallest time unit in the index (half-an hour). Since there are 365 * 24 * 2 = 17520 half-hours in a year, it would create 17520 graphs. This would be useless, unreadable and would probably crash your computer.
Takeaway: we need to be careful when we want to use the gg_subseries() function with sub-daily data.
A first possibility would be to specify period = “day”. This is a first option if we have sub-daily data because a day is the immediately superior time unit if we are dealing with hourly data.
Question: what is the output of this code?
vic_elec %>%gg_subseries(Demand, period ="day") +labs(y ="[MW]",title ="Total electricity demand for Victoria, Australia" )# Necessary to explore the output given the density of the dataggsave("subseasonal_demand_hourly_PeriodDay.svg", width =30, height =5)
Question: what is the output of this code?
vic_elec_m <- vic_elec %>%index_by(month =yearmonth(Time)) %>%summarise(avg_demand =mean(Demand) )vic_elec_m %>%gg_subseries(avg_demand, period ="year") +labs(y ="[MW]",title ="Total electricity demand for Victoria, Australia" )
Answer:12graphs (the period ="year" divided in the smallest time unit (month)
Question: what is the output of this code?
We could also aggregate by quartervic_elec_q <- vic_elec %>%index_by(quarter =yearquarter(Time)) %>%summarise(avg_demand =mean(Demand) )vic_elec_q %>%gg_subseries(avg_demand, period ="year") +labs(y ="[MW]",title ="Total electricity demand for Victoria, Australia" )
Answer:4graphs (period ="year" subdivided in the smallest unit)
Exercise - subseries plot
Aggregate the vic_elec data into daily data computing the average demand for each day using index_by() + summarise() combined with mean()
Create a gg_subseries() plot that has one graph for each day of the month and plots the values on each day for each month only for the year 2012
See expected output on the separate filesubseasonal_demand_daily_PeriodMonth.svg
See output in separate file "subseasonal_demand_daily_PeriodYear.svg"Answer:365graphs (Period = year subdivided in the smallest unit = days). This is utterly useless, of course. But it helps you understand the behaviorof gg_subseries()